The WiLI benchmark dataset for written natural language identification

نویسنده

Martin Thoma

چکیده

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available,1 free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 235 000 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incorporating Dialectal Variability for Socially Equitable Language Identification

Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequ...

متن کامل

TweetLID: a benchmark for tweet language identification

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (i) distinction of similar languages, (ii) detection of multilingualism in a single document, and (iii) identifying the language of short texts. In this paper, we describe our work on the development of a...

متن کامل

UCOM offline dataset-an urdu handwritten dataset generation

A benchmark database for character recognition is an essential part for efficient and robust development. Unfortunately, there is no comprehensive handwritten dataset for Urdu language that would be used to compare the state of the art techniques in the field of optical character recognition. In this paper, we present a new and publically available dataset comprising 600 pages of handwritten Ur...

متن کامل

HARRISON: A Benchmark on HAshtag Recommendation for Real-world Images in Social Networks

Simple, short, and compact hashtags cover a wide range of information on social networks. Although many works in the field of natural language processing (NLP) have demonstrated the importance of hashtag recommendation, hashtag recommendation for images has barely been studied. In this paper, we introduce the HARRISON dataset, a benchmark on hashtag recommendation for real world images in socia...

متن کامل

Towards Technology Structure Mining from Text by Linguistics Analysis

This report introduces the task of Technology-Structure Mining to support Management of Technology. We propose a linguistic based approach for identification of Technology Interdependence through extraction of technology concepts and relations between them. In addition, we introduce Technology Structure Graph for the task formalization. While the major challenge in technology structure mining i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

The WiLI benchmark dataset for written natural language identification

نویسنده

چکیده

منابع مشابه

Incorporating Dialectal Variability for Socially Equitable Language Identification

TweetLID: a benchmark for tweet language identification

UCOM offline dataset-an urdu handwritten dataset generation

HARRISON: A Benchmark on HAshtag Recommendation for Real-world Images in Social Networks

Towards Technology Structure Mining from Text by Linguistics Analysis

عنوان ژورنال:

اشتراک گذاری